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Abstract 

We study the group test for DNA library screening based on probabilistic approach. Group test is a 
method of detecting a few positive items from among a large number of items, and has wide range 
of applications. In DNA library screening, positive item corresponds to the clone having a specified 
DNA segment, and it is necessary to identify and isolate the positive clones for compiling the libraries. 
In the group test, a group of items, called pool, is assayed in a lump in order to save the cost of testing, 
and positive items are detected based on the observation from each pool. It is known that the design 
of grouping, that is, pooling design is important to achieve accurate detection. In the probabilistic ap- 
proach, positive clones are picked up based on the posterior probability. Naive methods of computing 
the posterior, however, involves exponentially many sums, and thus we need a device. Loopy belief 
propagation (loopy BP) algorithm is one of popular methods to obtain approximate posterior proba- 
bility efficiently. There are some works investigating the relation between the accuracy of the loopy 
BP and the pooling design. Based on these works, we develop pooling design with small estimation 
bias of posterior probability, and we show that the balanced incomplete block design (BIBD) has nice 
property for our purpose. Some numerical experiments show that the bias correction under the BIBD 
is useful to improve the estimation accuracy. 

Keywords: Group test; Pooling design; Loopy belief propagation; BIB Design. 



1 Introduction 

We study the group test based on a probabilistic approach. Group test is a method of detecting positive 
items out of a set of a large number of items, and has wide range of applications such as blood test or 
DNA library screening. 
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In the context of DNA library screening, our purpose is to identify clones having a specified DNA 
fragment from among a collection of DNA segments. Each DNA segment is called clones. The clone 
with a specified segment is referred to as positive clone, otherwise negative clone. For large libraries, 
it is impractical to screen each clone individually, instead a group of clones, called pool, is assayed 
in a lump. This is said to be group test or pooling experiment. When a pool gives positive result, the 
pool contains at least one positive clone, and otherwise all clones are negative. A number of pools are 
prepared, and outcomes from all pools are assembled to identify positive clones. 

There are mainly two categories of group test; one is adaptive, and the other is non-adaptive. In 
adaptive strategy, the pool is sequentially prepared and the test is conducted based on the information 
of previous outcomes. By repeating the test procedure, we can narrow down the set of positive 
clones. In non-adaptive testing, we prepare all pools to be tested before conducting the group test. 
The positive clones are detected based on the outcome of each pool. That is, the grouping of clones 
does not depend on the result of previous testing. When the group test for each pool is performed by 
distinct experimenters, non-adaptive method may not be time-consuming compared to adaptive one. 
In this article, we focus on non-adaptive testing. 

In group testing, we have two kind of detecting procedure; one is combinatorial and the other is 
probabilistic. In combinatorial group testing, the main issue is to construct the design of grouping or 
pooling design to reduce the number of testing without missing the positive clones. Combinatoria l 
group testing has been studied by many authors (|Du and Hwangl . ll999l ; lNgo and Dull2000l ; IWu et al 



lup 

m 



2004f) . In combinatorial approach, it is often assumed that the maximum number of positive clones 



is known and that there is no observation errors or noisy measurements. On the other hand, in prob- 
abilistic approach the prior probability for the state of clones is assumed, and po sterior probability 
such that each clone is positive is computed based on the observation of ea ch pool (|Knill et al.L 1 19961 ; 



Bruno et al.L 1 19951 ; iMezard and Toninellil 120071 ; lUehara and Jimbol 120091) . The main issue is to de- 
velop efficient algorithm to compute th e posterior probability, since using n aive Bayes formula is 
computationally de manding. Knill et ajj (Il996j) and lUehara and Jimbol (l2009|) have proposed a prob- 
abilistic algorithm. iKnill et all (1996) have use d the Markov Chain Mont e Carlo (MCMC) method 
to obtain the marginal posterior pr obability, and lUehara and Jimbol (|2009|) have exploited the loopy 
belief propagation (BP) algorithm (jPearll . 1 19881 ; iMacKayl . 1 19991) to compute approximate probability. 

Non-adaptive group test with probabilistic approach will be one of the most practical methods to 
detect positive clones from among large DNA library. Even in probabilistic approach, the pooling 
design is significant to achieve highly accurate estima tion of posterior probability. In loopy B P algo- 
rithm for the low density parity check (LDPC) coding (iMacKayll 19991; iRichardson et al.Ll200lb . it has 
been revealed that the coding design is closely related to the decoding error of the transmitted code. 
Likewise, the pooling design with some nic e property will provide a ccurate estimator o f the poste 



rior pro bability as experimentally shown by lUehara and Jimbol (|2009[) . In coding theory. Ilkeda et al 



(|2004ah have analyzed the relation between the coding design and the bias of the estimated posterior 
probability. We apply their result to improve the accuracy of the group testing. 

The outline of the paper is as follows. In Section [2] probabilistic description of group testing for 
DNA library screening is presented. In Section [3] we introduce loopy belief pr opagation algorithrn , 
and in Section|4]we show the bias the estimated posterior probability according to llkeda et al.l(|2004ar) . 
In Section [51 we construct a pooling design resulting in a small bias. Numerical experiments are 
presented in Section |6l Section |7] is devoted to concluding remarks. 
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2 Preliminaries of DNA library screening 



On DNA library screening, our purpose is to identify the positive clones out of a large DNA library. 
Let Xi be the random variable which stands for the label of the clone i for i = 1, . . . , n, that is, 
Xi = 1 for positive and Xj = for negative. The labels of all clones are denoted as the vector 
X = (Xi, . . . , X„). We assume that the random variables X^, i = 1, . . . , n are independent. The 
probability such that X = x E {0, 1}" for x = (xi, . . . , x„) is denoted by p{x) or p{X = x). Then, 
the probability p{x) is represented by the factorization of marginal probabilities, that is 

p(x) = Pl{Xi) X ■ ■ ■ X PniXn). 

Since the marginal distribution pj(xi) over {0, 1} is written as the form of exponential model pi(xj) oc 
exp{hiXi}, the joint probability p(x) is given as 

p{x)=exp{h^x-Mh)}, xG{0,l}'^ 

with h = {hi, . . . , hn) E M", where i'oih) is the normalization factor called the cumulant generating 
function. 

In the group test a number of clones are set in a pool and the experiment is conducted to detect if 
a positive close is included in the pool. Here the pool is identified by a subset of {1, ... , n}, and the 
clone i is included in the pool r if and only if i E r holds. For the pool r C {1, . . . , n}, let Zj. be the 
random variable defined by 

f 1 Er, Xi = l, 
I otherwise. 

Hence if Zr = 1, there is a positive clone in the pool r. Note that Z^ is also represented as 

Zr = maxXi = 1 - TT(1 - Xi). 

i Gr- 
in practice, Zr is not directly observed. The observation of the pool r is usually represented by four 
levels such that 

' if the pool r is negative, 

1 if the pool r is weak positive, 

2 if the pool r is medium positive, 
^ 3 if the pool r is strong positive. 

The response of the experiment is measured by using a fluorescence sign, and it is experimentally- 
confirmed that the conditional probability of 5*^ given Xi,{i E r) only depends on Z^, not the number 
ofiEr such that Xi = 1. We assume that the conditional probability of Sr given Zr is the same for all 
pools. Then, the conditional probability of Sr = Sr given Zr = Zr is denoted as p{Sr = Sr\Zr = Zr) 
oxp{sr\zr)- In practicc j9(S'r. = Q\Zr = 0) and p{Sr = 'i\Zr = 1) will take larger value than others. In 
the group test usually we prepare a number of pools. Let Q = {ri, . . . , r^} be the set of pools used 
in the group test. Then for each pool r E Q the observation Sr E {0, 1, 2, 3} is obtained. An example 
of a pooling design is shown in Figure [IJ 

The problem considered in the paper is to infer the label of clones based on the observation from 
each pool. More precisely, we want to pick up only positive clones out of all clones. As a probabilistic 
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Figure 1: An example of a pooling design. Q is given as {{1, 2}, {1, 3}, {2, 3, 4}}. 



approach, the method of maximum a posteriori (MAP) estimate is useful to detect the positive clones. 
Let S = {Si, . . . , Sm) be the random variable for the observation from all pools, and 

p{X = x\S = s) = p{x\s) 

be the posterior probability of X = x = (xi, . . . , Xn) given S = s = {si, . . . , Sm), where m is the 
number of pools. The label pattern x G {0, 1}"^ maximizing the posterior p{x\s) will provide the set 
of clones which are likely to be positive. 

We represent the posterior p(x\s) by p{x) and p{s\x). Using the Bayes formula, we can represent 
the posterior probability p{x\s) as 

p{x\s) oc p{s\x)p{x). 

For the distinct pools r, r' G G, the observations Sr and Sr' are conditionally independent for given x. 
Hence the probability p{s\x) is decomposed into the conditional probabilities of r G ^, and then we 
have 

For each observation s G {0,1,2,3}, the conditional probability p{Sr = s\x) is written as 

p{Sr = s\x) OC exp{c(s, x)} 

as the function of s, where c(s, x) is a real-valued function. When we compute the posterior proba- 
bility of p{x\s), the observations Sr,r E Q are regarded as constants, and thus c{sr,x) is written as 
Cr{x) as the function of the label pattern x G {0, 1}". Note that Cr{x) depends only on Zr which is a 
realized value of Zr defined in ([T])- Then, the posterior probability p{x\s) is given as 

p{x\s) oc exp 

reg 

Suppose that the parameter h and the functions Cr{x), r E G are known or these are estimated with 
satisfactory accuracy. 
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Table 1: An example of marginal posterior probabilities. The pooling design in Figure [B and the 
observation probability p{Sr = s\x) shown in Table|3]are used, and the marginal probability p{Xi = 
1) is set to 0.1 for all clones. 



(Sl,S2,S3) E {0,1,2,3)3 


p,iX^ = l\s) 


P2iX2 = l\s) 


PsiX^ = l\s) 


PiiX^ = l\s) 


(3,0,0) 


0.043 


0.047 


0.001 


0.011 


(2,2,0) 


0.853 


0.019 


0.019 


0.009 


(0,1,3) 


0.020 


0.016 


0.760 


0.180 


(0,0,3) 


0.001 


0.027 


0.027 


0.429 



In general the maximization of p{x\s) in Q over x £ |0, 1|" is compu tationally hard unless the 



set of pools Q has some special property JPearj 1988 : Cowell et al. . 2007b . Thus, we take another 



approach. The marginal probability of Xj for p(x|s) is denoted as Pi{xi\s), that is 



Pi[Xi\s) 



p(x'| 



(3) 



a;'G{0,l}" ■.x'^=Xi 



We think that the clones having large marginal posterior Pi{Xi = l\s) will be positive. Using a 
threshold for the marginal posterior, we will be able to detect the set of positive clones. As an example 
Table [H shows exact marginal posterior probabilities Pi{Xi = l\s). The pooling design in Figure [T] 
and the observation probability p{Sr = s\x) shown in Table[3]are used, and the marginal probability 
p{Xi = 1) is set to 0.1 for all clones. We see that the marginal posterior will be useful to detect the 
positive clones. 

The computation of the marginal posterior is still hard, since there are exponentially many sum- 
mands in ([3]). Despite this, we can compute an approximate posterior probability by applying so- 
called loopy belief propagation (loopy BP) algorithm. The details of loopy BP is briefly introduced 
in Section [3l 

3 Loopy Belief Propagation for Computation of Marginal Probability 



Loopy belief propagation is a meth od of computing an approximat e marginal probability, which is 
very useful in stochastic reasoning ( Pearl . 1988 : Cowell et al.l 2007). Let q{x) be a joint probability 
of high dimensional binary variable X = (xi, . . . ,Xn) G {0, l}'^. In the group test g(x) corresponds to 
the posterior probability p{x\s). The computation of the marginal qi{xi) involves exponentially many 
sums. To reduce the computational cost, we approximate the joint probability q{x) by a tractable one. 
Suppose that q(x) is represented by the form of Q, that is, 



q[X) oc 



exp 



X 



reg 



.X 



)} 



and we use the model 



to approximate q{x), where 9 



q{x; 9) oc exp \^h^x + 9~'^x} 



(4) 



(5) 



9i, . . . ,9n)^ G M" is an 77, dimensional column vector. The param- 
eter 9 is determined such that the function 9^x is close to J2reg '^r{x) up to additive constant. Then, 
the marginal probability of q{x) will be approximately given by 



qi{xi] 9) oc exp [hiXi + 9iXi], 



n. 
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As a result, we can obtain an approximate value of the marginal probability for q{x). The loopy BP 
algorithm provides an efficient method of computing the parameter 9. Suppose that 9 is decomposed 
into the sum of parameters G M", r e G, that is, 

We suppose that the function Cr.{x) is approximated by ^Jx for each pool r E Q. When the parameters 
$,r, r E Q are obtained in mid-flow of the algorithm, we show how to update these parameters. Let (^r 
be defined as 

Cr ^ ^ 9 ^j.. 

seg 

When the function Cr{x) is approximated by $,Jx, the probability q{x) is also approximated by 
exp{h^x + (Jx + Cr{x)}. We seek the parameter ^j. such that exp{h^x + Cjx + ^Jx} approxi- 
mates exp{/i^x + Cjx + Cr{x)} up to the normalization constant. The KuUback-Leibler divergence 

KL(p,g)= ^ p(x)log 



q(x) 

XG{0,1}" 

is used as the discrepancy measure between two probabilities p and q over {0, 1}". We consider the 
following optimization problem: 

min KL(», q) 

subject to p{x) oc exp{h^x + (Jx + Cr{x)}, q{x) oc exp{/i'''x + Cj x + x}. 
By some calculation, we see that the above problem is represented as the following form: 



max exp{/i^x + C^^ x + Cr{x)} ■ < ^7^^ ~ ^og{l + exp{/ij + + ^«}) f • 

xG{0,1}" ^ i=l 



(6) 



There are 2l''l summands in the function to be optimized, where |r| denotes the number of elements 
in the set r. When the size of the pool r is not large, the objective function in the optimization 
problem above is tractable. The parameter is updated to which is the optimal solution of Q. 
In the same way, the parameters ,^,.,r G Q and the sum 9 = J2reg^'^ updated sequentially. 
The convergent point of 9 is the output of the algorithm, and we obtain the approximated marginal 
probability qi{xi;9),i = 1, . . . ,n. The loopy BP algorithm is very useful in practice, though the 
convergence property of the algorithm is not theoretically guaranteed under general condition. 

In the literature of DNA library screening, the function c,.(x) depends on the value of Zr = 
and thus we define 



CriX) 



Cj-0 Zj- = {). 



Then, the objective function of © h as a simple form, and the updated parameter in the loopy BP 



algorithm is explicitly obtained. See lUehara and Jimbd (|2009l) for details. 
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4 Bias of Loopy Belief Propagation 



According to llkeda et alJ (l2004ar) we show the bias introduced by the loopy BP algorithm in the 
general setup. For each pool r E G, let Cr(x) be any real- valued function depending only on Xj, i G r, 
and q(x) be a probability on {0, 1}" defined as the form of (H]). Let gj(xj) be the marginal probability 
of q{x). We use the statistical model ([5]) to approximate the joint probability q{x). Let q{x; Oq) oc 
exp{/i^x + Oqx} be the convergent joint probability computed by the loopy BP algorithm applied 
to q{x). Usually the estimated marginal qi{xi] 6q) is no t equal to the true m arginal probability qi{xi), 
and the difference qi{x,i) — qi{xi] 9q) is said to be bias, llkeda et alJ (l2004a[) have analyzed the bias of 
the loopy BP algorithm, and obtained the asymptotic formula such that 



- gi(l;6'o) = B^si + (higher order terms), 



(7) 



r,s£Q 



where B^si is related to a geometrical curvature of statistical model 6). 

To show the definition of Br si, we need to define the matrices gij,gir, Qir and the third order tensor 
T. Let Xi and Cr be the expectation of Xi and Cr{x) under q{x; 9q), that is 



Xi — ^ ] Xiqi{xi] 9q), 



^c,.(x)g(x;6'o). 



Note that the expectation Xi is equal to qi{l;9o), since Xi is the binary variable. The matrix gij, i,j = 
1, . . . , n is the Fisher information matrix of the model q{x; 9)at9 = 9q, 

9ij = - Xi){xj - Xj)q{x; 9o) = Sij ^(x^ - Xi)'^qi{xi; 9o) = 5ijXi{l - Xi), 

where 5ij is the Kronecker's delta function such that 5ij = 1 for z = j and otherwise 5ij = 0. Likewise 
the matrix for i = 1, . . . ,n, r G ^ is defined by 

9ir = - Xi){Crix) - Cr)q{x] 9o) , 

X 

and let be 'cjir = gir/ gu- Moreover let the third tensor T be 



Tijk 


= E 

X 


yXi 


- Xi] 




-Xj]{Xk - 


Xk]q{x;9o], i,j,k = 


1,.. 


. ,n, 


ijr 


X 




- Xi] 


(Xj 




- Cr]q{x;9o], ij = 


1,.. 


. ,n, r eQ, 


irs 


= E 




- Xi] 


(Cr( 


x] Cr] (Cg 


{x] - Cs]qix;9o], i = 


1,.. 


. , n, r,s eg 



Then Birs is defined as 



Br si Tirs ^ ^ Tij^g jrgks ~l~ ^ ^^ ^yTijrgjs ~l~ Tijsgji 

j,k=l j=l 



rj, i = 1, . . . ,n, r,s e Q. 



(8) 



Once we obtain the approximate joint probability g(x; ^o)^ we can compute Br si without knowing 
the target probability q{x]. Thus, according to (|7]) the bias is corrected by adding Ylr^s^-rsi/'^ to 

g.(i;^o). 
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5 Relation between Pooling Design and Bias of Loopy BP Algorithm 



We show some properties of Brsi defined in ([8]). Let Cr{x) be the function on {0, 1}" depending only 
on Xi,i E r, then Cr(x) is represented as the form of 



Cr{x) = hr + ^ bri - ttrii) 



This fact is shown below. Let r 
then we define Cr.(x) by 



{il, . . . C {1, 



hr, Ke e M, argi G [0, 1]. 
, n}, and x be x 



Xi , . . . , 



(9) 

G{0,1}I^I 



Cr{x) = 6s JJ (1 - {Xi^ - Xjf) = 

sG{o,i}i''i i=i xe{o,i}i 



\r -, \r 

hl[{2xk-l) l[{x,^-{l-x,)). (10) 
J j=i 



k=l 



The function Cr{x) has the form of and the variable x G {0, 1}"^ satisfying Xi. = Xj for j = 
1, . . . , |r| is mapped to 6j. G M. By varying any function over {0, 1}" can be represented by the 
form above. Though the parameter a^a in ^ can be restricted to the binary set {0, 1}, we allow the 
mild condition a^i G [0, 1] for convenience. 

Example 1. For the group test 

Cr{x) = pr ■ maXXi = Pr{l - (-l)'""' TT(Xi - 1)} 

i&r 



is used. For the low density parity check (LDPC) codes the function 

Crix) = Pr ■ 11(1 - 2Xi) = Pr{-2f\ X{{Xi - 1/2) 

is exploited. In the above, the coefficient pr determines the intensity contributed from the pool r E Q. 

First, we show the condition that the bias vanishes. 

Theorem 1. Let Cr be real-valued function over {0, 1}" depending only on the variables Xj, i E r. 
Let r, s be distinct subsets . . . , n}. Then, for any functions Cr{x), Cs(x) and any i = 1, . . . ,n, 
Brsi vanishes if\r fl s\ < 1. 



Theorem [T] is a direct conclusion of Theorem 7 in llkeda et al.l (|2004ah . The proof is deferred 
to appendix lAl to show the explicit form of Brsi- Let the packing design be the family of sets Q 
satisfying |r fl s| < 1 for any r,s G Q, then Theorem [T] denotes that for the packing design the 
dominant bias term of loopy BP alg orithm vanishes. The packi ng design is use d in the design of 
group test dUehara and Jimbol l2009h and also in the LDPC code (|MacKayl Il999|) . It is numerically 
shown that the accuracy of approximate probability is superior to other designs with |r n s| > 2. In 
coding theory, lots of designs of low density parity check (LDPC ) code have been intensively studied , 
and the packing design is known as good error-correcting code (|lkeda et al.L l2004al : iMacKayl . Il999r) . 
In Theorem [T] these results are extended to any function Cr,r G Q. 

We consider the bias term Brsi for |r fl s| > 2. 
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Theorem 2. Let Cr and Cg be functions with the form of and suppose that there exists a constant 
C such that the coefficients bri, hgi satisfy 

I Kebsi' I < C. 

I/' 

Let Xi be the expectation of Xi under the probability q{x; 6q) oc exp{h^x + Ojx} and S be a real 
number satisfying 

< |a;j — ttriil < S < 1, < |xj — agul < ^ < 1, 
for any i,i,r, s. Then, the intensity of B^si is bounded above as follows: 

^|r| + |s|-2 / 2 \ l""^'*' 



l^-l<^-^— l^l + i^J (11) 

The proof is shown in appendix |Bj It is easy to see the right-hand of (fTTl) is increasing function of 
6>0. 

Example 2. The bias term in the group test is shown. The function Cr is defined as pr{l — Y[i£ri^~^i)) 
as shown in ExampleUl Suppose Xi = x holds for all i = 1, . . . ,n. Then the bias term Br si for i E r\s 
is given as 

X \ , , X 



\Brs^\ = \PrPs\x{l " - xf^ + \'^-' Ul + ^ j - 1 - \r H S 



1 — X 



< iPrPsl 1 



V 4(1 

The bias B^si for i E r Ci s is also computed in the same way. It is verified that \ B^si \ vanishes for 
\r n s\ < 1. When \r\ and \s\ are fixed, minimization o/|r fl s| will contribute to the reduction of the 
bias. 

Examples. LetXi = xfori = 1, . . . ,n. FortheLDPC, the function Cr{x) = p,.(— 2)l''l Y[i£r(-'^i~^/'^) 
is used. Then, \Brsi\for i E r\s is given as 

\rns 

(x- 1/2)2 J - r • ■ "I _ 1/2)2 

\2x - l|l''l+l*l^2 / ]_ \ \rns\ 



\Brs.\ = 2\prPs\xa - X)|2X - l|'-' + l^''l /^^^ Y" ^ - 1 - W H s\ "^^^ 



< 



PrPs 



(2x - 1)2 



It is verified that \Brsi\ vanishes for |r fl s| < 1. When x ^ 1/2, the bias \Brsi\ is increasing in |r fl s| 
when the size of pools \r\ is fixed. Thus minimization o/|r fl s| is important to reduce the bias. 

The dominant bias is represented as the sum of B^si- We assume that the constants 5 and C in 
Theorem[2]are also upper bounds for any pair of r, s e Q. Suppose that the size of subset is fixed, i.e. 
\r\ = d, and let m = Then, an upper bound of the bias is given as 

max |r n s| 



1^ Cm{m - 1) 2d-2 A ^ 1 V'^^^^^ 
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Suppose that C does not significantly depend on the pooling design. Then, the pooling design mini- 
mizing max^^-r^s \rr\s\ will lead a small estimation bias when we use loopy BP algorithm to compute 
the approximate posterior probability. In the group test C is almost independent of the pooling de- 
sign, when the size of the pool, |r|, is fixed. Indeed we can choose C = maxr,^ \PrPs\, where pr is not 
significantly depend on the pooling design. In terms of the minimization of max^^s |r fl s|, we have 
the following theorem. 

Theorem 3. For fixed integers m, n and d we consider the optimization problem 

min max |rns|, subject to |r| = V G ^, (12) 

where Q consists ofm subsets o/{l, . . . , n}. Suppose that there exists a pooling design Q satisfying 
the constraint of (fT2l) and the condition that 

i) |r n s| = A or A — 1 for all t-seQ, r ^ s, 

ii) \{r E Q \ i & r}\ = k or k — Ifor all i = 1, . . . ,n, 

where A = max^^gg.^-^^ |r fl s| and k = [mrf/n]. Then the pooling design Q is an optimal solution 



Proof. For a fixed pooling design let fcj be /cj = |{r G ^ | i G r}|, that is ki stands for the number 
of pools including the clone i. Then we have the equality 

n n 

ki = md, ki{ki — 1) = |r fl s|. 

i=l i=l r,sS:G 

Since the mean value is less than or equal to the maximum value, we have 



1 , 

max |r n s| > — ; > |r fl s| 

r,s£g mym — 1) ^-^ m[m — 1) 

r^s 

Some calculation leads that the quadratic function Ym=i ^1 minimized at . . . , kn) = {k, . . . ,k,k- 
1, . . . ,k — 1) under the constraint that Xir=i ^« ~ integers ki, . . . ,kn- Thus for any pooling 

design Q, the objective function in (fT2l) is bounded below by (XliLi ^ ~ md)/m{m — 1) which 
depends only on n, m and d. For the pooling design Q satisfying (fT3]) . we have 

V , , 1 kf — md - 

A = max r n s > ^'=] ' — > A - 1. 

r,s&g m[m — 1) 

The last inequality comes from the facts that |r fl s| is equal to A or A — 1 and that there exists a pair 
r, s such that |r fl s | = A. Thus Q is an optimal design, since Q attains the least integer which is greater 
than or equal to the lower bound of the objective function. □ 

The pooling design called balanced incomplete block design (BIBD) has the property such that 
in conditions i) and ii) of Theorem [3] equalities |r fl s| = A and \{r ^ Q \ i ^ r}\ = k always hold. 
According to Theorem[3j a BIBD is an optimal solution in the sense that it has the maximum possible 
number of clones n for given number of pools m among the designs satisfying (fT2l) if it exists for 
specified n, m and d. 
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A BIBD is often called a 2-design. The existence condition an d the cons t ruction method of 



BIBD's have been intensively investigated in the field of combinatorics (IBeth et al.l . ll999l : IColboum and Dinitzi 
20071). Among them, constructions based on finite fields and finite geometries are well investigated. 
Also many recursive constructions or compositio n methods are developed. Tables of the existing 
BIBD's for small orders are listed in Chapter 2 of^Colboum a nd Pinitz ( 2007 ). The desi gns utilized 



in this paper are constructed based on Theorem 2 i mWilsoni (i 19721) . See also Lemma 6.3 in lBeth et al 
d 19991) for details. 



6 Numerical Experiments 

The bias correction is examined in some numerical experiments. In the experiment, we specify the 
number of clones (n), the number of pools (m) and the size of pool |r|, and then construct a pooling 
design Q satisfying the condition |r fl s| = A for any pair of pools r,s E G, where A is a prespecified 
constant. Then, the group test is conducted by using the pooling design. 

In numerical experiments, the number of clones is set to n = 24, 1314 or 1552, and the pooling 
design is prepared based on the balanced incomplete block design. Table |2] illustrates the pooling 
design for each simulation. Basically, the same BIB designs are combined to make larger pooling 
design. In order to build the pooling design such that any pair of clones is not assigned exactly the 
same pools, we applied randomization technique. The priori probability for each clone is defined 
as Pi{Xi = 1) = 0.1 for = 24 and Pi{xi) = 0.002 for n = 1314 and n = 1552. As shown in 
Table |3]the conditional probability of the observa tion, p(Sr = Sr\Zr = Zr), has been estimated by the 



1996b . and thus we use the probability 



experiments of an actual DNA library screening (|Knill et al 
in our algorithm. 

In the simulation, some positive clones are randomly chosen out of n clones, and the observations 
Sr G {0, 1, 2, 3}, r E G are generated according to the defined probability. The number of positive 
clones varies from one to four. Then, we estimate the marginal posterior probability pi (xj | s) . The esti- 
mated probability is cor npared to the true p osterior probability computed by the Markov Chain Monte 
Carlo (MCMC) method iKnill et al.l (| 19961) . Table |4] shows the estimated result in the descending or- 



der of the marginal posterior probability. In both methods almost the same clones are highly placed. 
Note that the MCMC method is computationally demanding. We use the MCMC method in order 
to obtain precise posterior probability which is used to assess the estimated (bias-corrected) poste- 
rior probabilit y. In the numerical experiments, we use Concave-Convex Procedure (CCCP) algorithm 
(lYuillel |2002|) to compute the posterior probability instead of the conventional loopy BP algorithm. 
The CCCP has the same extremal solution as the loopy BP algorithm, though the CCCP may have 
better convergence property. The computation time is shown in Table [51 The CCCP is compared with 
the MCMC method. Overall CCCP is efficient for large set of clones. We have confirmed that the 
computation time for bias correction is negligible. 

The bias -correction term | ^^^^s-r seg ^rsi is added to the estimated posterior probability given by 
CCCP. The accuracy of the estimator is measured by the KuUback-Leibler (KL) divergence. Let qi{xi) 
be the true posterior given by the MCMC method for n = 1314, 1552. For n = 24, the exact posterior 
probability is available, the discrepancy between qi and the estimated posterior qi for i = 1, . . . , n is 
measured by 



1 " 

n ^ ^ 



qi{xi\s) log 



qi{Xi\s) 
qi{xi\s)' 



=1 Xie{o,i} 

In the numerical simulation we conducted the estimation 1000 times with different random seed, and 
the KL-divergence is averaged over the repetition. 
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Table[6l Table|71 and Table[8]show the resuhs for each pooling design. The first column shows the 
number of positive clones out of n clones, and the second and third columns present the averaged KL- 
divergence for the estimator given by CCCP and its bias-corrected variant, respectively. When |r fl s| 
is less than three, the bias correction works well to improve the accuracy of the estimated posterior as 
shown in Table[6]and Table[8l Table|7]shows the result using the pooling design satisfying |r fl s| = 3. 
In this case, the bias-correction does not necessarily improve the estimator. This result indicates 
that not only the dominant bias term ^ B^si but also the higher order term will be necessary to 
improve the estimator. 

In the simple experiments, the bias correction may be useful to improve the estimated posterior 
when the pooling design Q satisfies |r ft s| = 2 for r, s G ^. 



7 Concluding Remarks 

For the pooling design we h ave proposed the bia s corrected estimator of the marginal posterior prob- 
ability based on the result of Ikeda et al. ( 2004al lbl). We analyzed an upper bound of the bias term and 
showed that BIB design will make the bias small comparing to other pooling designs. In numerical 
experiments, the bias correction works well to improve the marginal posterior, even when |r fl s| =2 
holds for the pooling design Q. We confirmed that the correction of the dominant bias term does not 
necessarily improve the estimator, when the pooling design satisfies |r fl s| =3. Investigating higher 
order bias correction will be an important future work. 



A Calculation of 5, „ 



Theorem [His obtained as a direct conclusion of Theorem 7 in llkeda et al.l (l2004a|) . Here, we compute 
Br si to show its explicit form, and verify that Brsi = for |r fl s| < 1. 

As shown in ^ and (flOl) . any function Cr{x) over {0, 1}" depending on only Xi,i E r is rep- 
resented by the linear sum of the functions having the form of HiGrl^i ~ '^0' where G [0, 1]. 
Moreover, the bias term Brsi is bilinear in Cr{x) — Cr and Cs{x) — Cs- Therefore, it is enough to con- 
sider the case that Cr and Cs are given as Cr{x) = Ylisri^i ~ '^ri) and Cs{x) = Ylissi^i ~ '^^i) for 
ari,asi G [0, 1]. 

Let us define er{r') for the subset r' C r by 

Let £'[ ■ ] be the expectation by the probability q{x; Oq), then we have E[cr] = er{r). Building blocks 
for the calculation of the bias term are given as follows. The matrix gij, gir and gir are given as 



i ^r, 

Xi{l - Xi)er{r \ {i}) i E r, 

gir Jo i^r, 



.'. 9i: 

9ii [e,.(r\|zjj lEr. 
The third tensor Tj^fc is computed as follows: 

E \ Xi{l - Xi){l -2xi)er{r\{i})es{s\{i}) iErHs, 

yd otherwise 
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Table 2: Balanced in complete block (BIB) designs used in the simulation and the prior probability 
are shown. In our context the conventional notation (v, r, b, k, A) for BIBD(t>, r, b, k, A) corresponds 
to (m, \r\,n,nm/\r\, \r fl s\) for r, s E G, r ^ s, where |r| and |r fl s\ take a constant number. 
The identical BIB designs are combined to make larger pooling design. When the base design is 
BIBD(w, r, b, k, A) and the repetition is t, the pooling design defined from BIBD(t;, r ■ z,b ■ t, k, \ ■ t) 
is constructed by combining the base design. In order to build the pooling design such that any pair 
of clones is not assigned exactly the same pools, we applied randomization technique. 



^clones 


base design 


repetition 


prior: pi{Xi = 1) 


n = 24 = 12 X 2 


BIBD(9,4, 12,3, 1) 


2 


0.1 


n = 1314 = 438 X 3 


B1BD(73,24,438,4,1) 


3 


0.002 


n = 1552 = 776 X 2 


BIBD(97,32,776,4,1) 


2 


0.002 



Table 3: The cond itional probability estimated by the experiments of an actual DNA library screening 



(iKnill etal.Lll996h . 





= 




= 0) 


= 0.871 


PiSr = 


0\Zr = 


1) 


= 0.05 


P{Sr 


= 1 




= 0) 


= 0.016 


PiSr = 


l\Zr = 


1) 


= 0.11 


P{Sr 


= 2 


z^ 


= 0) 


= 0.035 


P{Sr = 


2 


Zj- — 


1) 


= 0.27 


P{Sr 


= 3 


Zf 


= 0) 


= 0.078 


P{Sr = 


3 


Zf = 


1) 


= 0.57 



Table 4: Estimated posterior probability in the preliminary experiments. The estimated probability us- 
ing loopy BP algo rithm is compared to the true posterior computed by the Markov Chain Monte Carlo 
(MCMC) method lKnill et al.l (119961) . The result is shown in the descending order of the marginal pos- 
terior probability. In both methods almost the same clones are highly placed. 





loopy BP 


MCMC 


rank 


clone id 


posterior 


clone id 


posterior 


1 


336 


0.8393 


336 


0.8345 


2 


768 


0.0615 


768 


0.0628 


3 


125 


0.0574 


125 


0.0608 


4 


764 


0.0419 


81 


0.0400 


5 


81 


0.0409 


764 


0.0382 



Table 5: The computation time (second) is shown. The CCCP is compared with the MCMC method. 
Overall CCCP is efficient for the large set of clones. We have confirmed that the computation time 
for bias correction is negligible. 



n 


CCCP 


MCMC 


981 


0.22 


1.91 


1298 


0.27 


2.49 


3088 


0.81 


8.49 


6371 


1.68 


17.60 


10121 


3.33 


27.73 


30050 


11.09 


81.80 
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Table 6: The numerical results for pooling design such that n = 24, m = 9, |r| =8, |r fl s| = 2 are 
shown. The prior probability is set to Pi{Xi = 1) = 0.1 for alH = 1, . . . , n. The first column shows 
the number of positive clones out of n clones, and the second and third column presents the averaged 
KL-divergence for the CCCP and its bias-corrected variant from the posterior given by the MCMC 
method, respectively. 



# positive 


CCCP 


bias -corrected CCCP 


1 


11.67e-04 


6.67e-04 


2 


10.57e-04 


6.01e-04 


3 


7.020e-04 


4.26e-04 


4 


4.160e-04 


2.76e-04 



Table 7: The numerical results for pooling design such that n = 1314, m = 73, |r| = 72, |rns| =3 
are shown. The prior probability is set to Pi{Xi = 1) = 0.002 for alH = 1, . . . , n. 



# positive 


CCCP 


bias-corrected CCCP 


1 


3.80 e-05 


2.40 e-05 


2 


1.80 e-05 


1.80 e-05 


3 


10.1 e-05 


14.3 e-05 


4 


4.80 e-05 


5.20 e-05 



Table 8: The numerical results for pooling design such that n = 1552, m = 97, |r| = 64, |rns| =2 
are shown. The prior probability is set to Pi{Xi = 1) = 0.002 for alH = 1, . . . , n. 



# positive 


CCCP 


bias -corrected CCCP 


1 


0.90 e-05 


0.90 e-05 


2 


1.70 e-05 


1.50 e-05 


3 


3.30 e-05 


1.90 e-05 


4 


2.80 e-05 


2.60 e-05 
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For Tijr we see that Tjj> = when i ^ r or j ^ r holds. Then, we compute Tjj> for i,j G r. 
When i = j G r holds, we have 

Tiir = E[{Xi - XiYiCrix) - C,)] 



E[{xi - Xif{xi - ari)]er{r \ {i}) - Xi{l - Xi)er{r) 
Xi{l - Xi){l - 2xi)er{r \ {i}). 



In the same way, we obtain 



Xi{l - Xi){l -2xi)er{r\{i}) i=jer 
Xi{l - Xi)xj{l - Xj)er{r \ {i,j}) i,j er,ij^ j 



Note that 



holds. For i ^ r U s, the equality Tijg = T^jr = holds, and for r fl s = we have Tij^g^j. = 0. Thus, 

i ^rUs or rns = =^ X](^iir5'js + ^iis5'j>) = 

J 

holds. Then for r fl s 7^ we consider other two cases: i E r\s (or i E s\r) and i E r (1 s. Some 
calculation leads to 

iEr\s : ^ {T^jr-gjs + Tysgjr) = Xi{l - Xi) %(1 - Xj)er{r\{i,j})es{s\{j}), 



i E T D S . ^ ^ (Tijrgjs T^jg^jr^ 



= 2xi{l - Xi){l - 2xi)er{r\{i})es{s\{i}) 

+ Xi{l-Xi) Xj{l-Xj)[er{r\{i,j})es{s\{j}) + es{s\{i,j})er{r\{j})]. 



j£rns\{i} 



For Tirs, we see that 



i ^rUs or rns = ^ T^..^ = 
holds. Under the condition that r fl s 7^ 0, we consider other two cases: i E r\s and i E r Hs, 

i E r\s : 



er(r\(s U {i}))es{s\r) JJ - x^) + {xj - arj)ixj - a,j)} 



jGrCis 



er{r\{i})es{s) 



i E r n s : 



(1 - ttri - asi)er{r\s)es{s\r) Y\ {^ji^ - ^j) + (^i - (^rj){xj - asj)} 

jGrns\{i} 



^^(l Xi^ 

- er{r)es{s\{i}) - er{r\{i})es{s 
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We show Br.j below. Remember that 



Then, we have 



j,k=l j=l 



rns = or i ^ rf] s =^ Brsi = 0, 



since all terms vanish. To compute other cases, we use the formula, 



rnsi 



k=0 v<Zrr\s j£v 
\v\=k 



and so forth. Then B^si is represented as follows. 



rPls 



ier\s: Brsi = -Xi{l - Xi)^ ^ er{r\{v U {i}))es{s \v)Y\_Xj{l - Xj) 
i erds : Brsi = Xi{l - Xi) 



k=2 t)Crns j^v 
\v\=k 



(2xi - 1) ^ Xk{l - Xk)er{r\{i,k})es{s\{i,k}) 
fcerns\{j} 

|rns|-l 

(l-ari-a,i) ^ er{r\{{i]VJv))es{s\{{i]VJv))Wxj{l 

k=2 v(Zrns\{i} j£v 

\v\=k 



Xjj 



Paying attention to the summation, we see that B^si vanishes for |r fl s| < 1. 
B Upper bound of | Brsi \ 

First we derive an upper bound of \Brsi\ for Cr{x) = Yli^A^i ~ '^") ^^'^ Csix) = Yliesi^i ~ '^^i)- 
Suppose that there exists 6 such that < \xi — ai\ < 5 < 1 holds for alH = 1, . . . , n and all 
appearing in Cr and Cg. Then, we obtain an upper bound of |i?rsi|- We use — ajj) < 1/4. For 
i G r\s we have 

k=2 vcms ^ / ^ / 

|t)|=fc 



and in the same way for i G r fl s we obtain 



-^rsi *^ 



|rns|-l 1 



452 'V 452 



^|r| + |s|-2 / \ |rns|-l 
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Therefore, for any case we have 
Next we suppose that 

Cr{x) = hr + '^ bri Yli^i " «r-£i)> 
Cs{x) = hs + ^ bsi' Y\_i^i' - O'si'i')- 

e i'es 

Since Brsi is bilinear in Cr{x) — Cr and Cs{x) — Cs, we have 
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